K Nearest Neighbors Project¶

I've been given a classified data set from a company! They've hidden the feature column names but have given me the data and the target classes.

** We'll try to use KNN to create a model that directly predicts a class for a new data point baseonof the feature s.

Import Libraries¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Get the Data¶

** Read the 'KNN_Project_Data csv file into a dataframe **

In [2]:
df = pd.read_csv('KNN_Project_Data.csv')

Check the head of the dataframe.

In [3]:
df.head() 
Out[3]:
XVPM GWYH TRAT TLLZ IGGA HYKR EDFS GUUB MGJM JHZC TARGET CLASS
0 1636.670614 817.988525 2565.995189 358.347163 550.417491 1618.870897 2147.641254 330.727893 1494.878631 845.136088 0
1 1013.402760 577.587332 2644.141273 280.428203 1161.873391 2084.107872 853.404981 447.157619 1193.032521 861.081809 1
2 1300.035501 820.518697 2025.854469 525.562292 922.206261 2552.355407 818.676686 845.491492 1968.367513 1647.186291 1
3 1059.347542 1066.866418 612.000041 480.827789 419.467495 685.666983 852.867810 341.664784 1154.391368 1450.935357 0
4 1018.340526 1313.679056 950.622661 724.742174 843.065903 1370.554164 905.469453 658.118202 539.459350 1899.850792 0

EDA¶

Since this data is artificial, we'll just do a large pairplot with seaborn.

Use seaborn on the dataframe to create a pairplot with the hue indicated by the TARGET CLASS column.

In [4]:
# THIS IS GOING TO BE A VERY LARGE PLOT
sns.pairplot(df,hue='TARGET CLASS',palette='coolwarm')
Out[4]:
<seaborn.axisgrid.PairGrid at 0x1c4950f3150>
No description has been provided for this image

Standardize the Variables¶

Time to standardize the variables.

** Import StandardScaler from Scikit learn.**

In [5]:
from sklearn.preprocessing import StandardScaler

** Create a StandardScaler() object called scaler.**

In [6]:
scaler = StandardScaler()

** Fit scaler to the features.**

In [7]:
scaler.fit(df.drop('TARGET CLASS',axis=1))
Out[7]:
StandardScaler()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
StandardScaler()

Use the .transform() method to transform the features to a scaled version.

In [8]:
scaled_features = scaler.transform(df.drop('TARGET CLASS',axis=1))

Convert the scaled features to a dataframe and check the head of this dataframe to make sure the scaling worked.

In [9]:
df_feat = pd.DataFrame(scaled_features,columns=df.columns[:-1])
df_feat.head()
Out[9]:
XVPM GWYH TRAT TLLZ IGGA HYKR EDFS GUUB MGJM JHZC
0 1.568522 -0.443435 1.619808 -0.958255 -1.128481 0.138336 0.980493 -0.932794 1.008313 -1.069627
1 -0.112376 -1.056574 1.741918 -1.504220 0.640009 1.081552 -1.182663 -0.461864 0.258321 -1.041546
2 0.660647 -0.436981 0.775793 0.213394 -0.053171 2.030872 -1.240707 1.149298 2.184784 0.342811
3 0.011533 0.191324 -1.433473 -0.100053 -1.507223 -1.753632 -1.183561 -0.888557 0.162310 -0.002793
4 -0.099059 0.820815 -0.904346 1.609015 -0.282065 -0.365099 -1.095644 0.391419 -1.365603 0.787762

Train Test Split¶

Use train_test_split to split your data into a training set and a testing set.

In [10]:
from sklearn.model_selection import train_test_split
In [11]:
X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['TARGET CLASS'],
                                                    test_size=0.30)

Using KNN¶

Import KNeighborsClassifier from scikit learn.

In [12]:
from sklearn.neighbors import KNeighborsClassifier

Create a KNN model instance with n_neighbors=1

In [13]:
knn = KNeighborsClassifier(n_neighbors=1)

Fit this KNN model to the training data.

In [14]:
knn.fit(X_train,y_train)
Out[14]:
KNeighborsClassifier(n_neighbors=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=1)

Predictions and Evaluations¶

Let's evaluate our KNN model!

Use the predict method to predict values using your KNN model and X_test.

In [15]:
pred = knn.predict(X_test)

** Create a confusion matrix and classification report.**

In [16]:
from sklearn.metrics import classification_report,confusion_matrix
In [17]:
print(confusion_matrix(y_test,pred))
[[105  56]
 [ 38 101]]
In [18]:
print(classification_report(y_test,pred))
              precision    recall  f1-score   support

           0       0.73      0.65      0.69       161
           1       0.64      0.73      0.68       139

    accuracy                           0.69       300
   macro avg       0.69      0.69      0.69       300
weighted avg       0.69      0.69      0.69       300

Choosing a K Value¶

Let's go ahead and use the elbow method to pick a good K Value!

** Create a for loop that trains various KNN models with different k values, then keep track of the error_rate for each of these models with a list. Refer to the lecture if you are confused on this step.**

In [19]:
error_rate = []

# Will take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))

Now create the following plot using the information from your for loop.

In [20]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
Out[20]:
Text(0, 0.5, 'Error Rate')
No description has been provided for this image

Retrain with new K Value¶

Retrain your model with the best K value (up to you to decide what you want) and re-do the classification report and the confusion matrix.

In [21]:
# NOW WITH K=30
knn = KNeighborsClassifier(n_neighbors=30)

knn.fit(X_train,y_train)
pred = knn.predict(X_test)

print('WITH K=30')
print('\n')
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))
WITH K=30


[[124  37]
 [ 19 120]]


              precision    recall  f1-score   support

           0       0.87      0.77      0.82       161
           1       0.76      0.86      0.81       139

    accuracy                           0.81       300
   macro avg       0.82      0.82      0.81       300
weighted avg       0.82      0.81      0.81       300

Export the Notebook as HTML¶

In [22]:
from nbconvert import HTMLExporter
import nbformat
exporter = HTMLExporter()
notebook_path = 'K-Nearest-Neighbors-Project.ipynb'
notebook = nbformat.read(notebook_path, as_version=4)
html_output, resources = exporter.from_notebook_node(notebook)
output_file = 'K-Nearest-Neighbors-Notebook.html'
with open(output_file, 'w', encoding='utf-8') as f:
    f.write(html_output)

That's all for Now!¶